A Stochastic Trust-Region Framework for Policy Optimization

نویسندگان

چکیده

In this paper, we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning. The goal is to find that maximizes total expected reward when agent acts according policy. subproblem constructed with surrogate function coherent general distance constraint around latest We solve using preconditioned stochastic gradient method line search scheme ensure each step promotes model stays in region. To overcome bias caused by sampling estimations under random settings, add empirical standard deviation of predicted increase ratio order update radius decide whether trial point accepted. Moreover, Gaussian which commonly used continuous action space, maximization respect mean covariance performed separately control entropy loss. Our analysis shows deterministic version proposed algorithm tends generate monotonic improvement global convergence guaranteed moderate assumptions. Comparisons state-of-the-art methods demonstrate effectiveness robustness our over robotic controls game playings from OpenAI Gym.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Trust Region Policy Optimization

We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks...

متن کامل

Stochastic derivative-free optimization using a trust region framework

This paper presents a trust region algorithm to minimize a function f when one has access only to noise-corrupted function values f̄ . The model-based algorithm dynamically adjusts its step length, taking larger steps when the model and function agree and smaller steps when the model is less accurate. The method does not require the user to specify a fixed pattern of points used to build local m...

متن کامل

Model-Ensemble Trust-Region Policy Optimization

Model-free reinforcement learning (RL) methods are succeeding in a growing number of tasks, aided by recent advances in deep learning. However, they tend to suffer from high sample complexity which hinders their use in real-world domains. Alternatively, model-based reinforcement learning promises to reduce sample complexity, but tends to require careful tuning and to date have succeeded mainly ...

متن کامل

Stochastic Trust-Region Response-Surface Method (STRONG) - A New Response-Surface Framework for Simulation Optimization

R surface methodology (RSM) is a widely used method for simulation optimization. Its strategy is to explore small subregions of the decision space in succession instead of attempting to explore the entire decision space in a single attempt. This method is especially suitable for complex stochastic systems where little knowledge is available. Although RSM is popular in practice, its current appl...

متن کامل

Derivative Free Trust Region Algorithms for Stochastic Optimization

In this article we study the following stochastic optimization problem. Let (Ω,F ,P) be a probability space. Let ζ(ω) (where ω denotes a generic element of Ω) be a random variable on (Ω,F ,P), taking values in the probability space (Ξ,G,Q), where Q denotes the probability measure of ζ. Suppose for some open set E ⊂ R, F : E × Ξ → R is a real-valued function such that for each x ∈ E , F (x, ·) :...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Computational Mathematics

سال: 2022

ISSN: ['2456-8686']

DOI: https://doi.org/10.4208/jcm.2104-m2021-0007